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MODaiNG or A LATENT FAULT 
DETECTOR IN A DIGITAL SYSTEM 


By Pivyllis M. Na^iel 

Vo ugh t Cm'pQ)'ation Hampton Taclinkal Coivter 


ABSTRACT 

Methods of model tnq the decoction time on latency peclocl of a hand- 
ware fault in a digital system are proposed that explain how a commitor 
detects faults In a computational mode. Tlie oh.iectives wei'o to study 
Ijow software I'oacts to a fault, to account for as many variables as possible 
affecting detection and to forecast a given prooram's detectino abilitv 
prior to computation. A series of experiments was conducted on a small 
eiiiiilated iiiio'oprocessor with fault injection capability. Results imiicato 
that the detectiiw capability of a program larooly depends on the instruc- 
tion subset used during computation and the frequency of its use and lias 
little direct dependence on such variables as fault mode, number set, 
degree of branching and program length. A model is discussed which employs 
an analog with halls in an urn explain the rate of which subseouent 
repetitions of an instruction or instruction set detect a oivon fault, 
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INTROnUCTIOM 


The concept of coverat^e as an Important variable in the 
reliability assessment of fault tolerant computer systems has long 
been recogni ted (ref, 1,2), Coverage, in effect, provides a measure 
of the chances of a system's recovery in response to a hardware fault, 

The determination of coverage for a given system largely depends 
on two variables: The time to detection or latency time of a 
fault during which the computer continues its computational task 
undisturbed, and the reconfiguration time, niven detection, during 
which the computer must isolate the fault and implement the recovery 
strategy of the system. Of the two the latter is the easiest to understand 
and is the most intuitive to the system desioner and conseouently is easier 
to model realistically in reliability calculations. It is no surprise, 
however, that realistic models of detection time are difficult to find. 

The variable is highly dynamic, not only fault deoendent, as is recon- 
figuration time, but also dependent on the type and scheriulino of the 
detectors detecting the fault and on the computational burden of the 
entire system. 

Models of coverage usually model the time to detection, in terms of 
a function of one or more random variables reflectinn the characteristics 
of the detector or detectors used in sensino the fault. CARE II (ref. 3, 

4) contains, by far, the most careful development of the mathematical 
interaction of these variaL-les by introducino the concept of competing 
detectors on fault classes. Unfortunately the use of this model in 
assessing the reliability of a specific system is handicapped L\v the complete 
lack of data with which to model and forecast the behavior of a niven 
detector beyond the realm of the educated guess. 


To roctiTy this doficionov ond undorstoiul I'lorp oomnloh'ly tlm 
nature ot latent faults, the present r(v>oarch has selected the cemearator' 
voter t’oi' detailed consideration. The importance of this detector 
is undeniable in that the use of votino across two or more channels as 
a detnctor of faulty output is basic to the desion of every redundant, 
reoonfieurable computer system. The problem of evaluatino the comparator/ 
voter as a detector is not an easy one howeve>', in that if is not one 
of evaluatino the effiv:ienc,v and performance of a particular piece of 
hardware. Since the result of a vote is based on the outnut of a 
prooram, the entity beino evaluated is the capacity of a pronram to 
detect hardware faults in a computat ional mode. 

The importance of a computation-based .inalysis in the calculation 
of coverage was recooni^ed by Never (ref. Pl. There he aroues that if 
reliability calculations are to reflect the computational needs of tlie 
user in the definition of system success then computation-based measures do so 
more accurately than standard, structure-based analysis. Thus, the present 
investigation concentrates on computation-based detection of harvlware faults 
in an effort to dain further insioht into the interaotion between structure and 
task as it influences covewuie. 

The workino hypothesis aovernino the evperiments eresented in tins 
paper is that different proorams witli varvinn proacam features such as 
the deoree of branchinu, the number of instruotions executed the type of 
instructions executed, the number set u.>ed in ccmmitatien, etc,, var\ 
both in their capacity to detect and in their rate of detection, Nhat 
then does this detection capabilUy depend on for a oiven nroeram and 
can it be foi'ocasted prior to computation from physical features of the 
program itself? 

The existence of a propram iiiiplfes the existence of a system within 
wliich it is to operate and to accurately evahjate these ouestlons the pronram 
should be investinated as it preforms in its computational environment. 

Since large diagnostic general purpose emulators do not exist, the properties 
of software detection were, explored undoi' less iiynamic conditions and the 
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results that follow are relative to this less than rGallstio environment. 

The experiments were conducted on a dloiinostlc emulator with fault 
injection capability at the f]ate level currently under development by 
the Aircraft Electronic Systems Branch nt NASA, Lanoley, The emulator 
was programmed to emulate a vei'y simple processor with thirteen Instruc- 
tions referred to in this paper as the very simple processor or VSP. 

Programs were written with this instruction set, run In the presence of randomly 
Injected gate faults and data collected on the accuracy o^ the output. 

EXPERIMENT 

The instruction set for the very simple processor contains the 
following thirteen instructions; 

*feteh and *store 
'‘'add and *subtract 
shift right and shift left 
AND and OR 
indirect addressing 
overflow indicator 
*branch 

copy to and from temporary stoiMoe, 

Six programs were written in the lamuiage and are described below. The 
results of the analysis that follows wore based on the output of simulatinn 
the first five programs and the output of the sixth was used as .a confir- 
mation case. As, a control device all six programs were coded using only 
the five starred instruction. 

1. Fibonacci (FIB) - Creates a sequence such that any member is the 
sum of the preceding two members starting with a pair of random 
initial values. Eight members of the sequence are generated. 

2. Fetch and Store (F&S) - Fetches a number from memoiw and stores 
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it 1n another location. This process is repeated eight times. 

3. Add and Subtract (ASS) - Four subtractions and four sums are 
computed alternately from values in memory. 

4 , Search and Compute (S&C) - Two random numbers are chosen from a 

list of the first twenty numbers and identified by a search. 

The 20 X 20 square region is divided vertically and horizontally 
by a random division D and additional computations performed in the 
areas indicated to form three separate branches j 


I 

III 

II 

u 



Branch I - a simple count 

Branch II - a subtraction and a count 

Branch III- a subtraction and a multi- 
plication 


A correct run was determined by a correct identification and a 
correct branch computation. 

5. Linear Convergence (LC) - A line with a random slope and intercept 
is adjusted in slope so as to cross the x-axis prior to a predetermined 
X value, xj. Once crossed, its deviation from the x-axis at Xi is 
minimized over slope, From the point on the line just optimized at 
Xj a new line of opposite slope is obtained by repeating this process 
at new value X 2 . Iterations are continued until a given number of 
computer cycles has elapsed. A successful run was one that completed 
this number of cycles without error. 



SAMPLE OUTPUT - LC PROGRAM 
5 


6. Quadratic (QUAD) - Computes the value of various quadratic polynomials 
. of the type Ax^ - Bx - C where A, B and C are positive integers and 
-10 1 X £ 10. For a given run, four sets of the four initial values 
are selected at random and four computations performed. 

A program was simulated by running it N times with random input, each 
run in the presence of a different single gate fault selected at random 
uniformly over the gate list. For each run the fault was injected prior to 
computation and the fault mode was determined by treating input and output 
faults, stuck-at-1 and stuck-at-0 as equally likely alternatives. 

It was not the intent of this investigation to explore faults in the 
voter/comparator but to evaluate how software reacts when executing in the 
presence of a hardware fault. Thus comparisons were made not b'/ voting over 
two or more copies but by comparing the output of the simulation to a correct 
value achieved either by hand calculation or by a fault free run of the same 
program under the same initial conditions, A record was then kept of the 
number of runs having faulty output for sample sizes that varied from 97 to 
211, The table below contains a summary of the recorded results. 


Program 

Sample Size 

Detections 

Estimated 

Detection Probability 

Estimated 

Standard Deviation 

FIB 

211 

98 

.464 

.034 

F&S 

118 

42 

.356 

.044 

A&S 

208 

117 

.563 

.034 

S&C 

118 

64 

.542 

.046 

LC 

133 

78 

.586 

.043 

QUAD 

97 

55 

.577 

.050 


Simulation Results 


Note the imprecision in the estimates of detection probability as measured 
by the standard deviation. This suggests that in general deviations from the 
given estimates by at least four in the second digit are still quite likely. 
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Initially a sample s1r.c of approximately ino runs per pronram was 
selected as optimal with reaard to time and budnet. Later two of the 
programs FIB and A&$, were extended to approximately twice that amount 
in order to evaluate to some extent the effect of sample size on the 
stability of the estimate of detection probability. 

DATA AMALYSL*^ 

»* 

Once the detection probabilities were obtained it was anticipated 
that variations between them could be explained in terms of variations 
in program features such as t!ie number of executed insti'iictions, the 
number of different instructions used in compufeatloi^, the denroe of 
branching, the number set, etc. The significant features could thwa be 
used to predict tlie performance of a niven pronram's canacity to detect 
hardware faults in the processor. 

The first table of the two that follow contains. a breakdown of 


several 

program features for 

each of the first 

five programs 

and the 



second 

summarizes the instruction set utilized 

during computation. 



• 

No. of Program 

No, of Executed 

Memory 

No. 


No. of 

Program 

Statements^, 



Uortjons 

^,ze 


Rj'ajichcs 

FIB 

12 

33 

17 

0 to 

85 

0 

F&S 

18 

IB 

27 

0 to 

85 

0 

A&S 


20 

35 

0 to 

85 

0 

S&C 

344 

151.5 (Avg.) 

375 

~50 to 

200 

3 

LC 

318 

Random 

341 

.t200 


Random 


Program Features 
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Pi^i'ain ^.CA/Vll Si|byb\i^ 

FIB X X X X 

■F&S X X 

A&S XX XX 

S&C X X X X X 

LC X X X X X 

Instruction Set 

Using linear regression, variations in the probability of detection 
were explored as a function of the entries for the first four programs in the 
table of Program Features. Since every combination of vai'iables considered 
by these methods produced at least one negative regression coefficient, this 
data did not begin to explain the source of variation in the results of the 
simulations . 

In contrast when the information in the Instruction Sot table 
was investigated a more consistent signal emerged. The following 
sections explore the nature of this signal and consider the question of the 
dependence of the probability of detection on the individual variables of 
the feature table In more detail. 

Instruction Set 

The primary difference in instruction set between the FIB program 
and the F&S program is the add instruction. With the addition of the add 
instruction in FIB to the fetch and store instructions in F&S the corresponding 
probability of detection Jumped from ,3Bd to .46'1. Similarly wlien a subtract 
instruction was added in AX.S to the instruction in FIB the probability jumped 
from .464 to .667 and remained approximately at this level (M2) when tlie 
instruction set was held constant in 




To vloterniino U' thosp »1lftVi‘otU’os o>*u roal or liuo to stattsticol orror 
sovoral statistiool to^ts of niootficonoo wero ooiuiuctotl. Ffjst a tost for 
equality betwoon the detection prol>3f nity of FIB and that of FSS was rodoctod. 
Thus the addition of the add instruction in FIB to those of FSS increased the 
detection probabtittv hv a siqnifiv'ant amount. Similai'ly a test for equality 
between the detection probability of FSB and ASS was rejected so that the 
inclusion of the subtract instruction increased the detection capability 
significantly. Since SSC and ASS use the same instruction seti a test for 
equality between their detection probabili ties should not reject if the overall 
thesis that detection primarily depends on the instruction set is valid. This 
was indeed the case. The estimated probabilities for these two programs do 
differ of course but the results of the test i?idicate that if fchei'o are real 

differences they are still buried in statistical error and tiuo’efore are 
much smaller than Hie differences between ASS and FIB for e'sample. 

brior to simulating the l.C program the early results from A&S were 
combined witli those from ssc to form, the estimate 


This was used as a point estimate te foiwast the behavior of the If pregiMm 
since the instruction set for l.O. is approvimately the same as that for ASS 
and SStf. Acknowledging that censiderable error still existed in this forecast 
due to sample size, a ‘HV-. confidence interval was computed as an interval 
estimate. Thus the interval 

[.Bob ‘ 1 .pB \ ct.BlB, .&:]] 

was the actual forecast for the detection probability of the U' program. The 
results of the IC. program based on IBB runs show a detection probability of 
.‘oBB which is well within the prediction intorval . Using all subsegueiU runs 
of A&S changes the interval to 
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C.Si)B * l.Gli ^ (.51?, 603] 

wivich stm valiviates the predietloiu 

IF the toy tiiicrofn’oopssoi' ‘Jhould be esiiDloyed as a serious cuiiiputatlonal 
device it would now be possible to forecast the detection probability of any 
program utilisiiw all or part of this iiistruetion 'ct of five instructions. 

The least squares estimate of that part of the detection probability due to 
the fetch and store Instructions, which includes as well the effect of those 
faults whose detection is comnon to all instructions, is .356. The additional 
contribution to the probability due to the inclusion of an add is .106 and 
the additional due to a subtract is .100. A total point ostimato for a program 
utilisimj all five instructions Is .504 

In the preceding discussion the role of the branch instruction was not 
established as its effect is not clear. Statistically this instruction has 
ijero effect as a predicto)' detection probability. It seems quite plausible 
that the effect of this Instruction over and above the othe)' foin* instruction? 
to which it is very closely related is so small that it cannot be separated from 
the random fluctuation still present in the data at these sample sices. 


fault Mode 

Of interest in studyiuii the properties of software fault detection is a 
determination of the strength of the relationship between detection and fault 
mode. That is are input faults more detectable than output, or .ce S-a~l faults 
more easily detected than s-a-OT Statistical chi-square tests of independence in 
contingency tables were performed on all programs for ..-.tection versus I/O, 
detection versus s-a-1 , s-a-0 ami detection versus the combined signal, fione of 
the s-a-1, s-a-0 tests were significant nor were any of the test conducted on the 
combined signal. Two of the I/O tests rejected independence namely FIB and 
S&C, the rest did not. Of tiie two that implied dependence the dependence was 
in opposite directions, that is output faults were more likely to be detected 
by the FIB program than input faults and for SSC the effoci was reversed. 
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When the samples from all fwe programs were combined the dependence in I/O 
was no longer significant, All tests were conducted at the level of 
significance. 


Number Size 

• * 

The dependence or independence of detection on number size Is difficult 
to measure directly from these experiments in that number size Is not a 
control variable and consequently expands, expands and contracts or changes 
at random throughout the program. By selectively sampling the data, however, 
new experiments can be defined which provide some information on the nature 
of this relationship. These experiments were conducted on the three essentially 
repetitive programs FIB, F&S and A&S. 


The F&S program performed each iteration on an independent random number 

having full octal range (0-8^), The experiment on number size consisted of 
matching a run that detected the fault at random with a run containing a 
fault that was not detected. For each pair the numbers executina at the time 
detection occurred were recorded for the detected run and its undetecting 
companion. Run by run differences across the pairs were then computed and 
this set of differences formed the basic sample. Tliis same procedure was 
followed for the A&S program except that for each iteration two independent 
numbers are involved in the computation instead of one. Thus the differences 
were taken between the averages of the two numbers executing at the time of 
detection for the detected run and its paired partner. 

It was then hypothesized that if number size was a significant factor 
In detection the mean value of these differences should shift away from zero. 

A statistical "t" test was conducted on this mean for each program testing for 
zero versus non zero, Neither test showed a significant difference from zero. 
It can therefore be concluded that there is no significant difference in runs 
that were detected and those that were not with respect to number size. 
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Fop fib, nui'tbep si^e is vi function of the siso of the two ramioii! initial 
inputs and therefore a sliiihtly different experiment was defined. First the 
averages were sorted by whether detection occurred, ci*eatiiui a sample from 
each of two populations. A test was conducted for equality of the two moans 
versus inequality as aecin it was hypothesiaed that if detection depended 
on number size there should be a shift away from equality. The test sinewed 
no significant difference between them. The density for each sample is 
plotted below. 



5 10 IB 

Avg. N'o. Size (lOO's) - FIB 



Branchlruj and Program Length 


The S&C program contains 3 distinct branches that can be ranked short, 
medium .and long as the average number of executed statements Is 42.3, 113.9 
and 305.3 respectively. This provides a means of testing if the branch effect 
is signiricant. The detected and nondetected runs were categorized by branch 
to form a 2X3 contingency table. A test for independence was conducted and 
failed to reject thereby implying that there is no strong evidence supporting 
the hypothesis that detection depends on branch 1n this case. 

Another test was conducted on this data to see if there was any dependence 
of detection on the number of statements actually executed during the running 
of the program, The number of executed statements ranged from 14 to 432. The 
data was divided into lOO's and a probability of detection calculated for each. 

The results are given below. 


1 . of Statements 

No . of Samples 

No. Detected 

Probabil ity 

0 - 99 

56 

27 

,482 

100 - 199 

22 

15 

.682 

200 - 299 

15 

11 

.733 

300 - 399 

19 

10 

.526 


Program Length vs. Detection - S&C Program 


The four data points over 399 were omitted from consideration. It was 
hypothesized that if there was no dependence of detection on the number of 
executed statements these probabilities should all be statistically equal. 

A chi-square test conducted on these four proportions assuming equality 
against all alternatives failed to reject. Thus at these sample sizes there 
is no evidence to support the contention that the detection is dependent on 
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the iniiubop oF expcuteij statonu'nts. This In further HubntanklatoU wlion it 1s 
notocl that the itita From on i?» all t'roin branch HI so that oven though 
computational (liscrepanov in minimal, the proportions, thmuili ciifferont, are 
contrary to expectations. 

When other programs are ronsiiioreil with respect to the iniinbor oF statements 
executed the same confusion in evident. The FSiS program takes about half as 
many executions an FIU to detect .d as many Faults. A&S, on the other Imnd, 
takes about .V an many oxecvitionn as Fill to detect 1 .F timon as many Faults. 

Iletection Time 

The sipnal comiiuj from the number of ntatoments executed 1s confounded 
with the sipnal comim] From the nature of the statements beitui executed, Thus 
it may be more rennonalvle to treat the entire program an an entity and attempt 
to predict its performance as a whole. 

When proportion of detection 1s plotted apainAt the number of times the 
program has repeated itseVf at the time of detection it is apparent that tliore 

is a dependence. The three programs A{vd, Fill and F&S are repetitive or nearly 
repetitive and provide a means for evaluating tlie natuv'e of this dependency. 

The following table gives the mimber of Failures fo\‘ each of these programs as 
a function of wliich repetition the program was executing when detection occurred. 
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at tha Numl'ai* at I aaks 


SovaiMl ittadi'ls hava I'aaii invasHaahHl tn an attaiiipt ta vliaaaataa1,‘a lha 
efficlanav of OalaaHan with aaaaaO (a aapaataO laaks at a lault, Onl.v tl\a 
two at tluMii liisaussaO balaw appaaa ta aUaouvttalv a\p1ain tha baliaviat' Otsplavoxl 
in tha ahava taMa tram hath a mimaaivwl anO intiiitna paint at vtaw, lha 
t'ipst 1s hasaO an a movlal paapasavt In OAlU U \aai‘. 0, 41 that tlina ta Oatoat, 
foa tha v'onipnraiaa'vataa apaaatino avMitlmiaush , Is axpanantlaUv Oistpllnitaa 
with v\ constant imilHpl»’aa Oanatlnn tha avaaall paahah!llt> at Oataation. Tiia 
sacond util leas an analoo.v at' halls in an \irn ta ntovial tha wav in which a 
ppQviaaui datacts a fault whaaa oanaaatino a fault Is aoulvalant ta vaav'hlnu Into 
tha urn and mark Inn a hall. Still anatliar phanamanan ralatad ta this vianaral 
nuastian is cited In tha sactian an tha canf Irmat lai\ praarvun. 


I \panantia1 Mavial 


Modifvinn tha assumptians af tha OARV 11 mavlt'l sllnhtlv ta rafhvt that 
till pronrams waaa tarminatad aftar anlv clnht itaratlansv laads ta tha 
fatlawiiHi structura an tha dansltv fumtlan af 
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where t is the time of detection measured in repetitions and T is the 
truncation time of the test, in this ease oipht! 
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There are two reasons why a fault may not he detected under the assumptions of 
this model; oi\e, because the Fault may not be detectable by the program in 
question (denoted by a) or two, because sufficient time has not elapsed for the 

fault to be seen (denoted by cO. Q measures vx, the probability of a fault 

' «\T 

remaining unclotoGted for all time, measures B, the probability of poinp 

umletected because of iobufficient time or attention by the program, ai»d 
is the stand alone detection probability. 

Maximum likelihood estimators for the paramctei’S and \ of this model 
reduce to solvinu the following pair of simultaneous transcendental equations; 


P^ l n/N 

1 ’ ■ ‘ 


where 

n 

5 number o 

f runs terminatinq in a detection, 


N 

I iiumber o 

f runs 

and 

t) « detect io 

n time in terms of repetitions. 



Frequency of Detection i%) 
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By solving the second equation Iteratively for \ and substituting Into the 
first the following table was computed, 


Program 


\ 

\(p£ 



A&S 

.56a 

.577 

1 .02 

.432 

.006 


.474 

,491 • 

1,04 

.526 

,009 

F&S 

.371 

.396 

1 .07 

.629 

.015 


MLE Estimates - Exponential Model 

It is Intei’cstlng to note that even though there 1s wide variation In the Individui 
GstiiiiQtes of pQ and there 1s remarkable stability in their ratio. The 
final two columns give estimates for the probability that the fault will remain 
permanently undetected by the program and the probability that the fault Is 
not detected duo to test truncation, respectively. Plots of this function 
superimposed on histograms of the detection data for each of the thi'ee 
programs appear below. 



Time to Detect (Repetitions) 
Exponential Model 
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Urn Model 


Though the previous niodel explains much of the variation in detection 
time, it only provides infoniiation on the rate of detection and not on the 
mechanism of detection. For this reason a new model is proposed in this 
section which explores, by means of an analogy with balls in an urn, the 
question of what a program experiences while it is executing ie a faulty processor 
and, once determined, provides a method of forecasting a program's detection 
efficiency as a function of time. 

Let S be the set of all gate states for a given processor, For this 
example then, S is the set of all tn'plets of the form (x\ , xa, xj) where xi 

is the gate, Xj designates its use as an input or an output gate and X;^ 

denotes its value. Let A be the subset of all gate states in S that are 
encountered during repeated use of a given progi'am, If, by analogy, the set 
S ’'s a set of balls in an urn consisting of two colors, say red and blue, 
representing the sets A and A’, respectively, then generating a random fault 
is equivalent to reaching into the urn and marking a ball, The probability 

that the ball is red is simply the stand alone probability of detection for 

that program designated in the exponential model. 

To complete the analogy it will be assumed that executing a program is 
equivalent to reaching into the urn and withdrawing a handful of red balls, 
since the program only exercises gate states in its detectable set. Each 
successive repetition of the program is modeled by assuming that the with~ 
drawn halls are replaced in the urn after every draw, and the siaes of the 
draws arc statistically the same. Instead of boing entirely independent, how- 
ever, the draws share some balls iivcoimnon. That is, from repetition to repetition 
the assumption is that some gate states are common to every computation 
performed on that program and some are not. 

These assumptions lead to the following model for the probability of 
detecting a fault on the kth dixaw, Pj,: 

P^ = PPy k «1 

P,. “ 0 - P)(l - ‘’’I 

IS 



where is as before, the stand alone probability of detection, P is the 
probability of detection on the first draw given that the draws are from the 
red balls, and n represents a probabilistic measure of the unshared pate 
states. Thus the probability of detection on the first draw is the probability 
of wi thdrnwing the marked ball given that the ball is )'ed, times the probability 

that a red ball is marked. The probability that the marked ball is withdrawn 
on the second draw is (1~P)^, Pq, where the (1-P)^, factor now represents the 

conditional probability that the marked ball is not withdrawn on the first 
draw, times the probability that it is selected on the second in that part of 
the draw unshared by the first, given that the marked ball is red. 


Modifying this distribution slightly to account for the fact that the 
test was truncated at k=j{l, the distribution for y, where y-inin (k,S) and 
k is the time of observed detection, becomes 


A 


h (y) H 


{i-p)(i~p)y‘'” 

P- r. {l-P)(l-r)^ ^ p+ Q 
O i=9 ° 


y « 1 
2 ly 

y = 8 Q « 1-P 

J Vq I < q 


el sewhere 


By assuming that | (1-P)(1 is 2 ero, appro.ximnte MLE estimates 
for pQ, P and p can be obtained for each of the three programs A^iS, FIB and 
FSS by solving the equations 

Fq = n/N 
P = t\i/n 

where n, N and are defined as in the exponential model and 

III = number of runs terminating in a detection 
during the initial repetition. 
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Results of tliese comnutatlons are Foimil in the table belowj 




r 
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A&S 

.bbd 

,bbd 

,dB!> AM 

.010 

RB 

.•Ibb 

.)hM 

.die ,Wb 

,oih 

F&,S 

,,lbb 


.ddd ,ivl4 

.0,1,1 


Mi.i: tst 

imates 

Urn Model 



Plots of this model appear as dashed eiieves in the graphs that follow, 
Though the function and tl\e observed data piotted with it are both discrete , 
the points have been conneeted with straioht lines for better visobilitv. 

It should be noted that the estimate for P^^ forces the model to eoineide with 
the data at the first repetition. Since, too, thei'e aro three free parameters 
to estimate in this model instead of only t\w In the exponential model , tlie 
fib is appreciably better. 



Time to Petoet tRopetitions) 
Urn Model 


,'0 


The quality of the fit, however particularly with regard to explaining the fall 
off at t “ 2 is consistent and appears to provide an explanation for detection 
not contained in the exponential model. 

It is premature to predict the performance of the parameters of the model 
in any, definitive way except to note certain broad features. In particular it 
is interesting that P is a fairly stable parameter over all three programs 
and that the variation between programs, in addition to that already discussed 
earlier with regard to the differ'^etices in overall performance as measured by 
pQ, is also evident in p, In particular A&S and FIB are the two programs which 
have been run at sample sizes large enough to nearly stabilize the estimate of 
pQ. For these programs 


Pq(A&S) p(A&S) 

That is the p's are in the same ratio as the P^'s correct to two decimal 
places. This fact is not true, however, when these programs are compared 
to F&S and attributing this to the remaining instability in the estimates of 
F&S can only be conjectured. 


Confirmation Program 

After data on the first five programs had been collected and a preliminary 
model formulated, it became apparent that the data for the S&C and LC programs 
did not entirely support the model with regard to rate of detection, If the 
model on rate was correct, subsequent repetitions of these programs should 
detect more faults and thus Increase their overall probability of detection. 

At the same time, however, the estimated overall probability of detection was 
already at the predicted level based on the first repetition of these programs 
only and hence under this prediction subsequent repetitions should detect very 
little. Since, too, both of these programs are much more complicated than those 
previously analysed, it is conceivable that neither model is correct and that 
further explanation is in order. Because of restrictions on the number of 
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cycles for the V5>P, it won not prootioal to inteiui either of these two pronrains 
to more repetitioiiH, Thus, to resolve this apparent contradiction si Nth program 
QUAD was written and simulated as an example of a more complicated proviram 
runninii in repetition. 

Since the Ql'AD program uses all ol the five previously defined instruc- 
tions, the data from the ASS, S.Sl' and \ C programs were combined to form a point 
estimate of .S64 as a forecast for the behavior of QUAD. The overall detection 
probability of QUAD based on a sample siae of o? is Not only Is this value 

close to the predicted value but it is also cUise to the point estimate .bb.i 
for AStS thouoh ASS is a hiohly time dependent detector. Wlien the -data for QUAD 
was sorted by repetition, b.' of tlie bb detected runs detected on the first 
repetition, three detected on the third and one detected en the fourth based 
on a test truncation time of 4. That Is, almost all of the faults causino 
faulty output did so the first time the propram looked at the fault; a 
phenomenon confirmtnp previous results obtained from tl\e 10 and bSa' proprams. 

Thus, tliouph the overall level of detection is predictable, there is little 
or no dependence of detection on future looks at the same fault when proevyuns 
pet complicated, 

Thoupli at first Thoupht these results are surprisinp, they oeceme less so 
wheivtlie complexity of the preprams are studied with the assumptions cf the 
rate model in mind, that is, when tliev are interpreted by the sasie model they 
appear to violate. A structure common to the 1,0 and QPAD eroaram is their 
repetitive use of comparison and muUipl ication loops to .lo miuPi of the 
computation. Since both of these instructien sets involve the use ot all five 
of the instructions under consideration, their use is repetitive but insi.le 
the propram. As a result the cimjmilative effect ef detection is sensed b\ the 
propiMin but is only tapped after tsany repetitions for each of the instructions 
have been excercised. Thus it is the intepral that is observed .ind not t!u' 
repetition. In effect, this leads te a new model of detection based not only 
on the instruction set but on the frepuenev with which a piven instruction is 
used durinp compuration recopnicinp that detection is a decayinp turiction ot 
frepuoncy but with nearly constant rate. 


Suggestions for Continued Experimentation 


Several Issues have been raised In the preceding discussion that could 
be validated somewhat more completely if additional experimentation was 
conducted. Whether they should be continued on the VSP or await a more realistic 

computer model Is difficult to assess 1n that 1t depends on the credibility 
of the small computer. In any event the following list contains suggestions for 
continued experimentation. 

1. The prediction method for predicting stand alone probability 
is severely hampered due to the statistical error in the 
estimated probability. Therefore several programs need 

to be run to large sample sizes in order to sharpen the 
forecasting tool , 

2. Forecasting the parameters of the urn model requires new 

• programs run to larger sample s’izes as suggested In 1. 

These would provide data for predicting the level and 
stability of P and the dependence of p on P^. 

3. The Interactive model for more complex programs involving 
the dependence of detection on instruction set and their 
frequency of use during computation is not definitive. A 
series of controlled experiments could be designed to 
investigate this dependence. 

4. When the Interactive model 1s completed it could be tested 
for Its prediction capability by varying the fault Injection 
time. 

5. The emulator now has a memory with fault injection capability 
Though it is expected that the degree of modeling complexity 
is no way near as complicated as that for the processor, 
simulations could be conducted to determine if this is so. 

6. When a processor is used as a component in a fault 
tolerant flight computer, tasks will be run back to back 
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with 1 ‘opetitions of one program interspersed with repetitions 
of others. As a result it is not sufficient to study a 

program in Isolation but in tandem with other programs. 

As a first step this issue could be e.\plored using two 
programs in tandem that have already been studied in 
isolation. 

7. Recent infomation has indicated that pin failure instead 
of gate failure is a more realistic model of IC failure, 

Some of these experiments should be repeated to explore the 
consequences of tliis information, 

CONCLUSION 

The intent of this investigation has been to propose methods of modeling 
the duration and extent of latent faults by exploring the way a computer 
detects faults in a computational mode, In particular the desire was to account 
for as many variables as possible affecting detection and to use them in fore" 
casting a given program's detecting ability prior to computation. 

This investigation has shown that relative to the simplified version of 
a microprocessor used in these experiments, detecting capacity of a program 
largely depends on the Instruction subset used in computation and the frequency 
of its use and has little dependence, if any, on such variables as fault mode, 
number size* branching and program length. 

For simple programs that control the use of a given instruction or instruc- 
tion set, two models are explored that show a decaying dependence of detection 
on repetitive use of the program. The most interesting of the two models 
employs a simple analogy with balls in an urn, to explain both the mechanism 
of detection and the rate at which subsequent repetitions of the program detect 
a given fault. For more complicated programs, the data supports the contention 
that where repetitive use of an instruction is excercised during computation, 
detection is equivalent to sensing the cummulative effect of repeated exposure 
to the fault. Thus complex programs detect more the first time they see a 
fault and repeated looks provide little additional information. 


24 


Though these results should be regarded as exploratory In part because 
of the computational environment and 1n part because there is still a consider 
able amount of statistical error in the data, nevertheless they should be 
considered with some seriousness, The picture they provide of the computing 
pracess is both reasonable and self consistent and it is conceivable that with 
a larger facility emulating a realistic processor, the methods and results 
presented here can provide insight into a more general theory of computation 
based detection, 


Vought Corporation Hampton Technical Center 
Hampton, Virginia 23666 
June 15, 1978 


25 


REFERENCES 


1» BauHclus, W, G*5 Cartel’, W. C.; and Sclmelder, P. R,: Reliability 
Modeling Techniques for Self-Repairing Computer Systems. Proc. ACM 
Annual Conf., pp 295-309, 1969. 

2. Bavuso, Salvatore J,: Impact of Coverage on the Reliability of a Fault 

Tolerant Computer. NASA TN D-7938, 1975. 

3. Raytheon Co., Equipment Development Laboratory! Reliability Model 

Derivation of a Fault-Tolerant, Dual, Spare-Switcliing, Digital 
Computer System. NASA CR-1 32441 , 1974. 

4. Raytheon Co., Equipment Development Laboratory: An Engineering Treatise 

on the CARE II Dual Mode and Coverage Model , NASA CR-1 44993, 1976. 

5. Meyer, John: Computation-Based Reliability Analysis, IEEE Trans, on 

Computers, vol. C-25, No. 6, 1976, pp 578-584, 


26 


I lll»(Kjft Ni» 

NASA CIHA!).)/1 

>) Itiin iuul Niibiiiiti 


i Ut<v(i(Mnii<iti Aui*ssi(>n No 



Molh'l int) ol ii hUiMit I ljult notoctoi’ in a IVItjtt.nl Systniii 


[‘hylHs M. NiUiol 

t) I'l ittlKDIMIJ iVijilllt.'illlllll NjiIU' ilHil AllllM’M 

Voimhl; t'orpmsU ion 
llainpion iivlinkiil ii'nlor 
[lamp I Oft, V try in la 

I* ‘■I'lnuiiihlij Ayi'iu > N.liim .iml AiM(i‘»v 

Nattoiial Aononaut il's aiul Spaoo AilininisliMtion 
Washfmtton, l)i: :'i)S.Ki 


.1 Ht** • , Viitiiiiiy Ni* 

S ttl'IKIIl itilltf 

Si'pli'nihor l‘)/M 

II I'litiUMiiitij |lry.tiii!titi»it r»ilo 
II iVHtirintni) lli'ixvt No 

1(1 Work Oiiil No 


II I'linlioi I <ir til, nil Nil 

NASMil'.iOO 

1,1 I vi'<> I’t lli'imK iiiiit IVrioil luvrinil 
Coni iVKFkir Ro)ioi’l. 

III ‘i|Umsiiimij CihIii 


NASA lodniital Nanation; S. J. liaviiso 

111 Aliilijit 

Mollunis ol iiioiloMiui llio tli'toclion Miik' ui' laionry poi'kul of a liat'tJwai'o laull 
ill a dii|i(a] sysli'iii am pruposi'tl tliat oxplain luiw a niiiiputor ilotocln iaiiUs In a 
ninipuialional iiioilo. Dio ohjnct ivtvs worn lo stiiiiy liow soltwaro i'o,U‘ts to a lauK, 
to account lor as many variahU's as possilVIo arroctinu dotoctioii ami to lormast 
a qivrn proipMin's ilotoclimj ahillly prior to coiiiputation. A st>rio‘, o1 oxprrinu'nt s 
was comlui.ttHl on a Mitall isiiulatml tnicropm’ossor with faiilt injoction capaldiiv. 
Ki'siilts iiulicati' thal tiu' ilt'toctinn capability ol a protirain iariirly dopriuK on tlu‘ 
instruct ion suhsi't iisril iliiriiiti coiiipiitat Imi and tno tmiuoncy ol Us usr ami has 
Httio diroct drpriidt'iicr on such variahirs as iault iiiodt', iiumboi' sot, dourrr ol 
braiichimi amt proqraiii lomitli. A imutol is dfscussml which oniploys an aiialoi) with 
balls in an urn to oxplain tho tsiio ot which subsoiim'iit ropolitions of an insirmlion 
or instruction srt dotrt t a ulvi'ii taiilt. 


I? Ki'y WnriU (SinlitiSIt'il hy Aiitliiutlll 

I a tout laut t rrl iahii i ty 

Iault Covi'caqo 
tlotocl ion 


T' ' " ' ' ' 

1)1 |ii'<lnlmtiiiii Sliili'imMit 

liiirlassl t'iod • llnlimitiHl 


ID liiMiniy Clima (ol ttiii ii'isnil 

.’0 Si’i unty t'liissil (iil tins ikiiM) 


No ol l’.ii)i>s 



uncjassj ripd. 


Ui 


